Introduction

...

On Crowdflower, each revision is rated 10 times. The raters are given three questions:

Is this comment not English or not human readable?
- Column 'na'
How aggressive or friendly is the tone of this comment?
- Column 'how_aggressive_or_friendly_is_the_tone_of_this_comment'
- Ranges from '---' (Very Aggressive) to '+++' (Very Friendly)
Does the comment contain a personal attack or harassment? Please mark all that apply:
- Column 'is_harassment_or_attack'
- Users can specify that the attack is:
  - Targeted at the recipient of the message (i.e. you suck). ('recipent')
  - Targeted at a third party (i.e. Bob sucks). ('third_party')
  - Being reported or quoted (i.e. Bob said Henri sucks). ('quoting')
  - Another kind of attack or harassment. ('other')
  - This is not an attack or harassment. ('not_attack')

Loading packages and data



In [1]:

    
%load_ext autoreload
%autoreload 2
%matplotlib inline
from __future__ import division
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from crowdflower_analysis import *
from krippendorf_alpha import *
from krippendorf_alpha_grrrr import *



In [2]:

    
pd.set_option('display.max_colwidth', 1000)



In [3]:

    
dat = pd.read_csv('../../../../data/annotations/nda/nda onion layer 5 raters 10.csv')



In [4]:

    
dat = dat[dat['_golden'] == False]
# Replace missing data with 'False'
dat = dat.replace(np.nan, False, regex=True)
attack_columns = ['not_attack', 'other', 'quoting', 'recipient', 'third_party']
for col in attack_columns:
    dat[col] = create_column_of_counts(dat['is_harassment_or_attack'], col)



In [5]:

    
chosen_ids = set(dat['rev_id'].unique()[0:1000])



In [6]:

    
sub_dat = dat[dat['rev_id'].apply(lambda x: x in chosen_ids)]



In [7]:

    
groups = sub_dat.groupby('_worker_id')



In [8]:

    
data = []
for g in groups:
    df =g[1][['rev_id', 'recipient']]
    d ={}
    for i, row in df.iterrows():
        d[row['rev_id']] = row['recipient']
    data.append(d)



In [9]:

    
krippendorff_alpha(data, metric = nominal_metric)









    Out[9]:





0.45132419296394688



In [10]:

    
cleaned_df = clean_df(sub_dat)



In [11]:

    
Krippendorf_alpha(cleaned_df, ['not_attack_0', 'not_attack_1'])









    Out[11]:





0.47022523695316831



In [12]:

    
'''
for key in grouped_dat.keys():
    print "Krippendorf's Alpha (aggressiveness) for layer %s: " % key
    print Krippendorf_alpha(grouped_dat[key], aggressive_columns, distance = interval_distance)
    print "Krippendorf's Alpha (attack) for layer %s: " % key
    print Krippendorf_alpha(grouped_dat[key], ['not_attack_0', 'not_attack_1'])
'''









    Out[12]:





'\nfor key in grouped_dat.keys():\n    print "Krippendorf\'s Alpha (aggressiveness) for layer %s: " % key\n    print Krippendorf_alpha(grouped_dat[key], aggressive_columns, distance = interval_distance)\n    print "Krippendorf\'s Alpha (attack) for layer %s: " % key\n    print Krippendorf_alpha(grouped_dat[key], [\'not_attack_0\', \'not_attack_1\'])\n'